Towards Unsupervised and Language-independent Compound Splitting using Inflectional Morphological Transformations
نویسندگان
چکیده
In this paper, we address the task of languageindependent, knowledge-lean and unsupervised compound splitting, which is an essential component for many natural language processing tasks such as machine translation. Previous methods on statistical compound splitting either include language-specific knowledge (e.g., linking elements) or rely on parallel data, which results in limited applicability. We aim to overcome these limitations by learning compounding morphology from inflectional information derived from lemmatized monolingual corpora. In experiments for Germanic languages, we show that our approach significantly outperforms language-dependent stateof-the-art methods in finding the correct split point and that word inflection is a good approximation for compounding morphology.
منابع مشابه
Language-independent compound splitting with morphological operations
Translating compounds is an important problem in machine translation. Since many compounds have not been observed during training, they pose a challenge for translation systems. Previous decompounding methods have often been restricted to a small set of languages as they cannot deal with more complex compound forming processes. We present a novel and unsupervised method to learn the compound pa...
متن کاملA Language-independent Approach to Extracting Derivational Relations from an Inflectional Lexicon
In this paper, we describe and evaluate an unsupervised method for acquiring pairs of lexical entries belonging to the same morphological family, i.e., derivationally related words, starting from a purely inflectional lexicon. Our approach relies on transformation rules that relate lexical entries with the one another, and which are automatically extracted from the inflected lexicon based on su...
متن کاملUnsupervised morphological parsing of Bengali
Unsupervised morphological analysis is the task of segmenting words into prefixes, suffixes and stems without prior knowledge of language-specific morphotactics and morpho-phonological rules. This paper introduces a simple, yet highly effective algorithm for unsupervised morphological learning for Bengali, an Indo-Aryan language that is highly inflectional in nature. When evaluated on a set of ...
متن کاملTo stem or lemmatize a highly inflectional language in a probabilistic IR environment?
Effects of three different morphological methods-lemmatization, stemming and inflectional stem generation-for Finnish are compared in a probabilistic IR environment (INQUERY). Evaluation is done using a four point relevance scale which is partitioned differently in different test settings. Results show that inflectional stem generation which has not been used much in IR, compares well with lemm...
متن کاملMorphological Analysis of Inflectional Compound Words in Bangla
The addition of inflectional suffixes in Bangla compound words is fairly complex. Normally, when two root words are joined, the corresponding inflectional suffix of each root word is deleted from the final compound word. In Bangla however, the compound word’s individual root words may retain their inflectional suffixes even in the final compound word. This non-deletion of inflection creates an ...
متن کامل